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ABSTRACT 

With the advent of very large redshift surveys of tens to hundreds of thousands 
of galaxies reliable techniques for automatically determining galaxy redshifts are be- 
coming increasingly important. The most common technique currently in common use 
is the cross-correlation of a galactic spectrum with a set of templates. This series of 
papers presents a new method based on Principal Component Analysis. The method 
generalizes the cross-correlation approach by replacing the individual templates by a 
simultaneous linear combination of orthogonal templates. This effectively eliminates 
the mismatch between templates and data and provides for the possibility of better 
error estimates. In this paper, the first of a series, the basic mathematics are presented 
along with a simple demonstration of the application. 
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1. INTRODUCTION 

The development of fiber-based spectrographs 
capable of observing hundreds of objects simulta- 
neously has led to the advent of many large red- 
shift surveys with the intention of furthering our 
understanding of the large scale structure, clus- 
tering and evolution of galaxies. (For a review see 
Strauss 1996.) Examples include the Las Cam- 
panas Redshift Survey of 26,000 galaxies, just 
completed (Shectman et al. 1996), and the Two 
Degree Field (2dF) Redshift Survey (Taylor et al. 
1997, Maddox et al. 1997) which has started this 
year (1997) and will measure the redshifts of over 
250,000 galaxies over the next several years. 

Because of the sheer size of these surveys it 
is becoming very important to develop methods 
of reliably, and quantifiably, measuring the red- 
shifts of the galaxy spectra without manual in- 
tervention. For example in the 2dF survey a 
method with a 95% success rate would still leave 
12,500 spectra to be inspected manually, a very 
large task. Ideally any automatic redshift cal- 
culation should also give an accurate error esti- 
mate and confidence rating for each redshift to 
indicate which 12,500 galaxies out of the 250,000 
need further, possibly manual, attention. 

At the current time the most successful and 
widespread method of automatic redshift mea- 



surement is cross-correlation analysis ( Tonry fc 
Davies 197£). In this method the galaxy spec- 



trum is cross-correlated with a series of template 
spectra corresponding to a sequence of standard 
galaxy or stellar types. The size of the largest 
peak in the cross-correlation function is an indi- 
cation of the quality of the match between the 
galaxy and the template spectrum. The position 
and width of the peak give the redshift and an 
'error' on the redshift. If the galactic and tem- 
plate spectra agreed exactly then a sharp corre- 
lation peak would be found, but in practice it is 
unlikely that the galactic spectrum will exactly 
match any of the template spectra. Depending 
on the size of the mismatch, the redshift may or 



may not be correct - the 'error' is merely a mea- 
sure of the accuracy of the location of the peak 
and not an indication of its true worth. Tonry 
& Davis presented a formulation for the error 
on peak location, which was improved upon by 
Heavens (1993). 

A series of templates consisting of different 
types of galactic spectra, individually tested, is 
not necessarily the optimal template set to use. 
It would be preferable to generalize the concept 
of cross-correlation to use a simultaneous linear 
combination of templates, with expansion coeffi- 
cients that depend on the redshift. With a suit- 
able choice of template spectra, the mismatch 
between the data and a linear combination of a 
small number of template spectra could be re- 
duced to an arbitrarily small amount. Any resid- 
ual would be due only to the random component 
of the observational noise. 

In this paper, the first of a series of papers, 
a method is presented for achieving this. The 
method, which we will call 'PCAZ', is based 
upon the use of Principal Components Analysis 
to make the general linear problem amenable to 
efficient computation. The fundamental math- 
ematics is presented in section 2, and a sim- 
ple demonstration based upon some sample 2dF 
galaxy spectra is shown in section 3. Subsequent 
papers will present in more detail the methods 
of robust error analyses and software for imple- 
menting the PCAZ algorithm. 

2. MATHEMATICS BEHIND PCAZ 
2.1. Standard Cross-Correlation Revisited 



Consider a galaxy spectrum G\ (with nor- 
mally distributed errors, variance u^) requiring 
a redshift z, and a single template spectrum T\. 
If both the galactic spectrum and the template 
spectrum are binned on the same logarithmic 
wavelength grid the likelihood that the galaxy 
and the template are the same, bar the redshift 
and normalization, can be written: 
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where the sum is over discrete wavelength bins, 
(A = 1, 2, 3, . . .) and A is the linear shift along the 
logarithmic grid due to the redshift, A oc log(l + 
z). a(z) is the redshift dependent coefficient of 
the template. At any particular redshift z we 
can find the value of a{z) that maximizes the 
likelihood {i.e. gives the best match between the 
galaxy and the template) by setting dx 2 /da = 0. 
Solving this gives: 
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It can be seen that a{z) in equation [2] is sim- 
ply proportional to the cross-correlation function 
of the galaxy spectrum with the template spec- 
trum, for the case where the variance is ignored. 
Substituting this value of a{z) into equation |l| 
and simplifying gives: 
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The minimum of x 2 as a function of redshift oc- 
curs when a(z) is maximized, thus finding the 
peak of the cross-correlation function (or CCF) is 
exactly equivalent to finding the maximum like- 
lihood redshift at which the single template best 
matches the data (again putting in the variance 
introduces complications, see section|2~4|) . 

This maximum likelihood basis for cross-correl- 
ation is fundamental to the linear generalization, 
but has not been remarked upon in the astronom- 
ical literature. The approach that has been used 
historically to assign a confidence or quality value 
to the redshift has been based upon the height 
of the CCF peak above the CCF 'noise' (see for 



example Heavens 1993). However much of this 
'noise' is due to systematic mismatch between 
the template and the data rather than observa- 
tional noise and thus the assumption that peaks 
are uncor related is invalid. With the formulation 
given above, and realistic errors, it would be pos- 
sible to assign a true likelihood value and hence 
confidence intervals to a peak if the template and 
the galaxy were identical and just differed due to 
the observational noise and the redshift. 

2.2. Linear Generalisation 

The standard cross-correlation method of Tonry 
& Davis tests the candidate galaxy spectrum 
against a range of template spectra individually. 
The linear generalization presented here essen- 
tially assumes that a galaxy spectra can be ex- 
panded as a linear sum of template spectra. This 
in principle allows the systematic mismatch be- 
tween galaxy and template to be arbitrarily re- 
duced and hence a realistic likelihood to be as- 
signed to an output redshift. 

Initially, for simplicity, we will consider how 
one solves for the values of the coefficients at zero 
redshift. We assume a galaxy spectrum is repre- 
sented by an n dimensional vector G, where n 
is the number of wavelength bins. The m tem- 
plate spectra are represented by the rows of an 
m x n matrix T. The galaxy spectrum is then 
fitted by a linear combination of templates with 
coefficients au: 



Ga^E^a 



(4) 



The coefficients, a,j, may be found by following 
the same maximum likelihood recipe used above 
and minimizing x 2 where now: 
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We have now introduced W\ as representing a 
general wavelength dependent weighting function 
(which might be used, for example, to emphasize 
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particular spectral features). Setting d\ 2 jQct-i = 
leads to the matrix equation: 

Ca = TG' (6) 

where the elements of the vector G' are G' x = 
Wx 2 G\/ 'a\ and the elements of the m x m corre- 
lation matrix, C, are given by: 



Ti\Tj\ 



(7) 



Direct inversion of the C matrix to obtain the 
cij coefficients is clearly impractical. Not only 
would it be numerically intensive to do this at 
many trial redshifts, but the presence of very 
small eigenvalues (see section 2.2 below) would 
lead to large numerical instabilities. However, 
if the template spectra (the rows of the matrix 
T) are replaced by a basis set of orthogonal vec- 
tors, the transformed correlation matrix will be 
diagonal, and the problem simplifies. Principal 
Component Analysis (PCA) is the tool used to 
select the orthogonal vectors. 

2.3. Principal Component Analysis 

Principal Component Analysis is a technique 
frequently used for data compression and clas- 



sification ( Kendall fc Stuart 196€ , Murtagh fc 
Heck 1987] ) . In particular, direct PCA of spec- 
tral data similar to that used here has been 



used for classification of galactic spectra (Mit 



taz et al. 1990, Connolly et al. 1995|, [Folkes et al. 



1996, Sodre fc Cucvas 1997| ) and for classification 



of QSO spectra ( Francis et al. 1992| ). 

In essence, PCA finds the 'best' representa- 
tion of a set of data by a set of orthogonal vec- 
tors, or principal components, which can be com- 
bined linearly to reconstitute the data. The com- 
ponents are ordered in terms of significance in 
a least squares sense and data compression is 
achieved by retaining only the most significant 
principal components. 

PCA can be formulated in two different but 
equivalent ways, both of which have been used for 



spectral classification. Consider a set of m tem- 
plate spectra sampled at n discrete wavelengths. 
The elements of the matrix T can be pictured 
as a series of row vectors, each of which is a 
point representing a spectrum in n-dimensional 
wavelength space. Alternatively, the data can be 
thought of as column vectors with each point in 
m-dimensional template space being the set of 
fluxes in an individual wavelength bin. A PCA 
in the template space diagonalizes the elements 
of the m x m correlation matrix: 



C i 
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where we, for now, ignore weights and variance 
factors for clarity in the discussion. A PCA in 
wavelength space diagonalizes the elements of the 
n x n correlation matrix: 



i\-2 



(9) 



The two approaches are equivalent and in prin- 
ciple they will lead to the same eigenvalues and 
principal components that are related by a sim- 



ple transformation ( Murtagh fc Heck 1987 ). In 
many ways the wavelength space is more intu- 
itive for spectral classification and Mittaz et al. 
(1990), Francis et aZ.(1992), Folkes et al. (1996) 
and Sodre & Cuevas (1997) have used a PCA in 
wavelength space for this. Connolly et al. (1995), 
who only used a small number of spectra, chose 
to work with the reduced dimensionality of tem- 
plate space. In order to clearly show the link 
between the cross-correlation method and PCA, 
and because we have fewer template spectra than 
wavelength bins we have chosen to follow Con- 
nolly et al. and perform the diagonalization in 
template space. 

In practice this means taking the set of tem- 
plates and constructing from them a set of or- 
thogonal 'eigentemplates'. The matrix C is diag- 
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onalized, to yield a set of eigenvalues: 



C = RAR T where A 
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and associated matrix of m-dimensional eigen- 
vectors, R, which are the principal components 
in template space. The diagonalization is accom- 
plished by standard numerical techniques such 



as Singular Value Decomposition ( Mittaz et al. 
19SQ). The matrix R defines a transformation 
between the template spectra and a set of n- 
dimensional orthogonal eigentemplates, the prin- 
cipal components in wavelength space Ei\: 
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This is essentially the Karhumen-Loeve trans- 
form ( pVIurtagh &: Heck 1987 ). The resulting 



eigentemplates satisfy the orthogonality prop- 
erty: 

Y,E iX E jX = A l 5 i3 (12) 

A 

where <5jj is the Kronecker-5. The eigenvalues 
Aj represent the contribution of each eigentem- 
plate to the set of templates in a least squares 
sense. If the principal components are arranged 
in order of decreasing eigenvalue it can be shown 
( [Kendall fc Stuart 1966| ) that the first principal 
component in either space is the line along which 
the cloud of points is the most elongated (has the 
greatest variance). Equivalently, the first princi- 
pal component is the line for which the sum of 
the squared perpendicular distances of the points 
from the line is a minimum. Similarly, if the 
points are projected onto a hyperplane orthogo- 
nal to the first principal component, the second 
principal component is the line in that hyper- 
plane along which the projected distribution is 
most elongated. Representing the data in terms 
of just the first principal component would be 
equivalent to approximating the cloud of points 
by a line and characterizing each point in terms 



of its projected distance along the line. Repre- 
senting the data in terms of the first two principal 
components is equivalent to projecting the cloud 
of points onto a plane. 

The spectra within the template set can be 
represented to any given accuracy by a linear 
combination of eigentemplates: 



(13) 



where p is the number of eigentemplates retained. 
Since the eigentemplates are orthogonal the cor- 
responding expansion coefficients are given by: 
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where Aj is the jth eigenvalue derived above. 

In practice only a subset of the principal com- 
ponents represent real correlations and anticorre- 
lations between the spectra within the template 
set. The remaining principal components may 
contain a large fraction of uncorrelated noise in 
which case they can be discarded. Folkes et al. 
show that the number of significant principal 
components, p, depends on the quality of the 
template data set. Reconstruction of the tem- 
plate spectra from the first p principal compo- 
nents effectively filters out much of the noise. In 
the case where the input template set consists of 
a few very high signal/noise spectra it may be de- 
sirable to retain all the eigentemplates — in this 
case the PCA analysis can be viewed as a short- 
cut for speeding up the solving of equation |6] for 
a large number of redshifts. 

To apply PCA to redshift determination it is 
necessary to assume that the template set is suf- 
ficiently general that any galactic spectrum not 
included in the original template set can also be 
represented to the required accuracy by a sum- 
mation over the first p principal components. Es- 
sentially we are assuming that the correlations 
within the template set reflect a global correla- 
tion across all galaxies in the survey. Allowance 
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for abnormal objects such as stars, active galaxies 
and quasars can be made by including example 
spectra of these in the template set or by discrim- 
inating against bad matches (see Section 3.2). 

2.4. Relation to Cross- Correlation and 
Redshift Determination 

The discussion of PCA above is general and 
up to this point follows the spirit of the spectral 
classification of Mittaz et al, Francis et al., Con- 
nolly et al. and Folkes et al. The extra step is 
to include the redshift z as an additional vari- 
able. Weighting can be retained, but must be 
tied to the rest frame of the templates, the vari- 
ance must be assumed wavelength independent 
(but see below). The coefficients of the eigen- 
templates then become: 



b,(z) 
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where each bj{z) is the cross-correlation func- 
tion of the galaxy spectrum with the jth eigen- 
template weighted by the corresponding eigen- 
value. If gk is the discrete fourier transform of 
the galaxy spectra, G\ and ejk is the discrete 
fourier transform of w\Ej\ then the coefficients 
bj(z) are given by the inverse fourier transform 
of the product of gk and ejk- 

, . . 1 ^ /2nikA\ . . 

= iVA~ ^ 9k&jk 6XP ) 

The orthogonality gives the simple relation for 
the joint likelihood: 
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where the variance is a 2 . The minimum of equa- 



tion |17| gives the maximum likelihood redshift, z 
through: 



where A m j n is the shift that gives the minimum 
value of x 2 - Note that the single cross-correlation 
function in equation has been replaced in equa- 



tion 17 by a weighted sum of the squares of the 



Z\ = in A mm<5(log A) _ 1 



(18) 



individual cross-correlation functions. This is a 
natural result given that the eigentemplates are 
orthogonal. 

To preserve the orthogonality of templates 
with redshift it is not possible to weight by a 
wavelength dependent variance, because the er- 
rors will be tied to the observed frame. For 
a strongly wavelength dependent variance the 
method still gives the optimal fit in a least 
squares sense by minimizing the function: 

/ = £( G A-£k(*)^,A+A) 2 ( 19 ) 
A i 

One can then reintroduce a\ for a final pass and 
calculate a true likelihood for the final redshift, 
though this may not be the absolute maximum 
likelihood because it is not used in the calculation 
of the bj's. In practice this will mean a few more 
objects may fail to have their redshift determined 
within specified likelihood bounds. 

It should be noted that the logarithmic wave- 
length scale used for redshift determination gives 
the correct weighting of spectral features. Since 
AA << Athenlog(A+AA)-log(A) « AAlog(e)/A 
so the fractional wavelength range per bin is con- 
stant across the spectrum. The important fea- 
tures in the eigenfunctions are the spectral lines 
which should be equally weighted. For a classi- 
cal grating, AA oc A for unresolved lines, and for 
Doppler broadening the same holds true. Thus 
logarithmic binning gives correct equal weight- 
ing of features. Of course most real systems give 
close to a linear wavelength scale, so the spectra 
must be resampled to logarithmic bins which will 
introduce covariance between neighboring pixels. 
However this will be very small as the wavelength 
scale only changes very slowly across the spec- 
trum. 

The PCAZ method has numerous advantages 
over previous methods in the literature: 
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1. Because it is just a set of cross-correlations 
the standard Fast Fourier Transform (FFT) 
method can be used to efficiently compute 
the bj(z)'s. The simultaneous combina- 
tion of m eigentemplates takes the same 
computer time as doing m templates sepa- 
rately. 

2. Existing cross-correlations codes can be used 
with little modification. They only need to 
be provided with orthogonalized eigentem- 
plates instead of the normal templates as 
inputs and have some provision made for 
combining the cross-correlation functions 
in quadrature afterwards. 

3. Emission line galaxies are easily handled 
by PCAZ. The standard cross-correlation 
method gives relatively poor results for 
these because emission line ratios vary much 
more than absorption lines and hence can 
no longer be accounted for by a small num- 
ber of standard galaxy spectra. However, 
with the extra freedom given by a linear 
combination of eigentemplates variable line 
ratios can be accommodated. This freedom 
means the method is robust against other 
wavelength-dependent variations such as 
only very approximate, or the absence of, 
flux calibration of the input spectra. 

4. High signal/noise (S/N) eigenspectra can 
be created from a large set of noisy data 
as well small set of high S/N spec- 
tra because each eigenspectra represent an 
average of that mode over the data. This 
would be especially suitable for a deep red- 
shift survey where many of the weak ultra- 
violet absorption features would be miss- 
ing from local templates. A few hundred 
high redshift galaxies could have their red- 
shifts measured manually. Eigenspectra 
constructed from these could be used to 
measure the rest automatically. 

5. The ability to calculate a likelihood means 



a true confidence could be assigned to a 
redshift and future science analyses of sur- 
vey statistics such as the power spectrum 
of galaxy clustering P(k) could include a 
realistic probability distribution of redshift 
errors rather than neglecting them. This 
is especially important for the next gener- 
ation of very large surveys. 

6. The maximum likelihood reconstruction from 
the coefficients bjEj\ is a noise filtered ver- 
sion of the data, which is useful for other 
analyses. 

7. The coefficients bj are have independent 
errors and could be used as the basis for 
classification scheme for faint spectra, ei- 
ther by themselves or as input into other 
systems such as Artificial Neural Net algo- 



rithms (e.g. Folkes et al. 1996). 



8. The provision of weights allows templates 
to be defined only in regions of interest, for 
example around strong lines. This would, 
for example, be particularly suitable for 
very faint low S/N data where one might 
wish to search for weak emission lines ap- 
pearing above the noise. With weights the 
rest of the noisy, possibly undetected, con- 
tinuum can be excluded from the x 2 . 

2.5. Practicalities 

There are a number of important practicali- 
ties involved in using the PCA formalism to de- 
termine redshifts. The first is the issue of mean 
subtraction. It is usual in PCA to subtract the 
mean of the distribution from each point, in the 
case of spectral classification the mean spectrum 
is subtacted from each of the template spectra 
prior to orthogonalization. This is equivalent to 
moving the origin of the PCA co-ordinate system 
to the center of the distribution of points. How- 
ever, strictly a redshifted mean spectrum should 
also be subtracted from the candidate spectrum 
whose redshift is not yet known. Because of this, 
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the mean spectrum was not subtracted prior to 
orthogonalization. 

A second important point is continuum sub- 
traction. Spectral classification schemes have 
avoided continuum subtraction in order to retain 



as Imuch spectral information as possible (Mit- 



ta z| et al. 199C , Ponnolly et al. 1995| , Folkes et al 
19^). However, continuum subtaction is more 
important for redshift determination. Contin- 
uum subtraction reduces the smoothly varying 
background to zero and essentially has the same 
effect as filtering out the long period fourier com- 
ponents of the spectra. Without continuum sub- 
traction the cross-correlation functions show a 
broad peak representing the cross-correlation of 
the two apodized continua, with a small spectral 
cross-correlation peak superimposed. 

A final practicality is the normalization of the 
template spectra. Francis et al., Folkes et al. and 
Sodre & Cuevas normalize to unit flux: 



(20) 



The alternative is to normalize to unit scalar 
product ( |Connolly et al. 1995 ): 



1 



(21) 



With continuum subtraction the resulting spec- 
tra oscillate about zero so normalization to unit 
scalar product was used. 

3. EXAMPLES 

In this section the method is illustrated us- 
ing a set of sample sky subtracted spectra. The 
method was developed using test spectra from a 
variety of sources, we choose to illustrate its ef- 
fectiveness here using some early data recently 
taken from the 2dF galaxy survey for which the 
algorithm is being developed. The 2dF survey is 
more comprehensively described elsewhere: see 
Taylor et al. (1994, 1997) for a description of the 
2dF instrument and Maddox et al. (1997) for an 



introduction to the galaxy survey. The data de- 
scribed here consist of two test fields, SGP463 
and NGP359, taken during 2dF commissioning 
for the survey in January-April 1997. The galax- 
ies are selected from the APM survey flMaddoxl 
et al. 199q ) with bj < 19.7. 



The 2dF spectra spanned a wavelength range 
of 3810A to 8227A with a 2 pixel resolution of 
around 8.4A (line full-width half- max). Two 
fields were considered, one as template spectra 
and one as candidate spectra whose redshift was 
to be determined. Redshifts had been previously 
assigned to this both data sets by visual inspec- 
tion (M. Colless and K. Glazebrook private com- 
munication). This gives a typical accuracy of 
Az ~ 0.0005 set by the spectral resolution. The 
template field contained a total of 91 galaxies for 
which redshifts had been assigned and the can- 
didate field contained 104 galaxies with known 
redshifts. The typical signal/noise of the contin- 
uum was 10-30 at 5500A which should be typical 
for the survey spectra. These are quite high sig- 
nal/noise and we expect a variety of methods to 
work well, in the analysis below we add artificial 
noise to degrade the spectra to test robustness of 
the method. 

3.1. Eigenspectra 

Two sets of eigenspectra were constructed. 
The first used five high signal/noise template 
spectra taken from an atlas of integrated spec- 
tra of local galaxies ( Kennicutt 1992| ). The five 
spectra chosen are listed in Table |]. They cover 
a wavelength range of 3600A up to 7050A. The 
spectra were rebinned on a log wavelength grid 
with a grid spacing of 51og 10 (A/A) = 1.7 x 10~ 4 . 
The second set of spectra were derived from 
the 2dF data itself. The NGP359 field was 
used. The 91 spectra with well determined red- 
shifts were corrected for redshift and used as the 
template set. A wavelength grid of 3100A to 
7007A was used for these with a grid spacing of 
<51og 10 (A/A) = 1.8 x 10~ 4 . 

The template spectra were continuum sub- 
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tracted and normalized prior to orthogonaliza- 
tion. For simplicity, a constant variance and unit 
weights were assumed at all wavelengths. The 
2dF spectra were fluxed using an approximate 
mean 2dF response curve derived from photomet- 
ric standards. Continuum subtraction was done 
at each point by subtracting the local median 
calculated over a 100-bin wide window centered 
at that point. The continuum subtracted spectra 
were normalized so that the sum of the squares of 
fluxes in the continuum subtracted spectrum was 
unity. With this normalization and unit weights 
the first term on the right hand side of Equa- 



tion 17 equals one, leading to a particularly sim- 
ple expression for x 2 . 

The resulting normalized spectra were orthog- 
onalized using a standard singular value decom- 
position routine. A selection of the resulting 
eigenspectra are shown in Figures |l] and 

The five eigenfunctions derived from the Ken- 
nicutt spectra are shown in Figure |l[ Figure || 
shows the 2dF eigenspectra with the five high- 
est eigenvalues. Orthogonalization of 91 spectra 
leads to 91 eigenspectra but as discussed in sec- 
tion 2 many of these represent noise. Five 2dF 
eigentemplates were retained for the redshift de- 
termination and three Kennicutt eigentemplates. 
It can be seen from Figure [l] that the first two 
eigenfunctions, derived from the different data 
sets, are very similar. These two account for > 80 
% of the variation in the input data. The higher 
order eigenfunctions come out differently for the 
different data sets, which is to be expected given 
the effect of noise on the exact location of the 
Principal Components. 

3.2. Redshift Determination 

Redshifts were calculated using both the Ken- 
nicutt eigenspectra shown in Figure |l] and the 
2dF eigenspectra shown in Figure ||. As most 
spectral information is contained in the highest 
eigenvalue eigenspectra only the first three Ken- 
nicutt eigenspectra and the first five 2dF eigen- 
spectra were retained. 



The 2dF spectra showed a number of residual 
sky features in the regions of strong atmospheric 
emission and absorption lines. Where these are 
the strongest features in the spectrum there is 
a danger that the correlation between the strong 
peaks in the eigenspectra (particularly the strong 
Ha line) and the sky residuals will be greater 
than the correlation between the templates and 
the much weaker galaxy spectrum. As a prepro- 
cessing step before orthogonalization sky resid- 
uals were removed in 60A bands around 5577, 
5892, 6300, 6363 and 7610A. The missing spec- 
tral bands were interpolated using least squares 
fit to the spectrum on either side and the spectra 
were rebinned onto the same wavelength grid as 
the eigenspectra. 

The rebinned spectra were continuum sub- 
tracted and normalized in the same way as the 
template spectra. The expansion coefficients, 
bj(z), can be quickly and efficiently found using 
fast fourier transforms. The FFT algorithms are 
most efficient if the total length of the series, N, 
is equal to a power of 2. In addition, because 
the FFT treats the series as a periodic function 
of period N, N must be greater than the sum of 
the length of the galactic and template spectra to 
avoid errors in the cross-correlation calculation. 
To this end, both template and galactic spectra 
were zero-padded to the power of 2 greater than 
the sum of their lengths. 

To illustrate the procedure, Figure ^ shows a 
selection of six of the input spectra. They have 
been corrected for the sky residuals but not yet 
continuum subtracted or normalized. The results 
are discussed using the Kennicutt eigenspectra. 
Figure |] shows the corresponding % 2 functions 
obtained. Calculated and manual redshifts are 
given in Table |2| along with the associated ex- 
pansion coefficients. 

Spectra (a) to (d) are typical of the majority 
of the spectra studied. They give calculated red- 
shifts that agree well with the manual redshifts. 
The corresponding x 2 functions show clear min- 
ima giving an unambiguous determination of the 
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redshift. Spectrum (e) is noisier and had no man- 
ual redshift assigned to it previously, however, 
the x 2 function gives a clear, albeit weaker, peak 
at a redshift of 0.06. Spectrum (f) is the spec- 
trum of a quasar included as a deliberate outlier. 
The method clearly fails to find a redshift for this 
spectrum, as expected since there are no quasar 
spectra in the template set. There will always 
be a minimum value of x 2 but it is clear from 
inspection of the corresponding \ 2 function that 
the associated redshift estimate is unreliable. 

The failure of the method to find a redshift for 
the quasar illustrates the importance of including 
all spectral types of interest in the template set. 
The method will fail to find a redshift for galax- 
ies whose spectra differ fundamentally from the 
template set (e.g. in instrumental resolution). 
However, in principal the method will work for 
all spectral types, including emission line galax- 
ies, provided that the relevant spectral types lies 
within the m-dimensional space spanned by the 
eigenspectra. The power of the PCA method 
lies in its ability to reduce the dimensionality of 
linear problems from many templates to a few 
eigenspectra with no loss of accuracy, and its re- 
sulting ability to filter out the noise from noisy 
templates. 

A side effect of this method is the ability to 
reconstruct 'filtered' versions of the spectra from 
the eigentemplates. With only a few eigentem- 
plates, the relative strength of the emission and 
absorption lines may not have fully converged, 
but a comparison of the reconstructed and orig- 
inal spectra helps to clarify how the method 
works. Figure || shows the reconstructed spec- 
tra corresponding to the first five original spectra 
shown in Figure ||. No reconstruction is given for 
the quasar spectrum. 

Of course PCA will fail to reduce the prob- 
lem space in non-linear cases, a practical example 
might be if one has a sample of AGN with very 
broad lines covering a large continuous range in 
velocity. However for redshift work most galaxy 
spectra are unresolved, or only marginally re- 



solved in which case the variation can be accom- 
modated in the eigenspectra. 

The ultimate test of the method comes with 
a larger scale comparison of the manually deter- 
mined redshifts and the calculated redshifts. Fig- 
ure |6] shows the comparison between the man- 
ually determined redshifts and the two sets of 
PCAZ redshifts calculated with the two sets of 
eigenspectra. It is clear that the agreement be- 
tween the PCAZ redshifts and the manually de- 
termined redshifts are very good for this field 
with a greater than > 98% success rate. The 2dF 
eigenfunctions performed the best giving only 
one mismatch at a redshift of 0.23. Clearly we 
need somewhat more than 104 spectra to deter- 
mine the error rate at this high level of success 
— something like 1000-2000 spectra are needed. 
We will look at this in more detail in Paper II. 

Poor sky subtraction remain possible sources 
of error in the automatic redshift determination. 
The PCAZ method took less than two minutes 
of computer time to calculate the 104 redshifts. 
The measured scatter of the points on the line is 
Az < 0.0005 which is what we expect from the 
instrumental resolution. 

With the PCAZ code it is trivial to turn off the 
steps of orthogonalization and quadratic combi- 
nation of cross-correlation functions — this en- 
ables us to reproduce the results of simple CCF 
analysis with the same template set. This is also 
shown in Figure ^ where the Kennicutt tem- 
plate with the highest CCF peak gives the CCF 
redshift. It can be seen that for these high sig- 
nal/noise spectra the results are similar whether 
or not the templates are diagonalized. This sim- 
ply reflects the excellent quality of the 2dF spec- 
tra with highly significant features for the al- 
gorithms to select. We anticipated the PCAZ 
method would perform better than simple CCF 
for lower signal/noise spectra (much of the initial 
testing was done with such spectra before we had 
access to 2dF data). To demonstrate this we add 
artificial gaussian noise to the 2dF data, both 
data and templates and decrease the continuum 
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signal/noise by a factor of 3, so the galaxies are 
typically S/N = 3-10, and repeat our analyses. 
Rerunning the PCA analysis gives virtually iden- 
tical 2dF eigenfunctions, the redshift results are 
shown in Figure [?|. It is evident that PCAZ still 
performs at the 98% level while the CCF method 
has dropped to 93% success rate. These results 
were obtained with minimal manual intervention 
and illustrates that PCAZ is more robust in the 
low signal/noise regime. 

Figure || shows four of the spectra where the 
non-orthogonal cross correlation method fails, la- 
beled (a) to (d) in Figure |7[ The lower curve 
in each panel shows the original 2dF spectrum. 
The top curve shows the spectrum plus added 
noise. The central curve is the PCA reconstruc- 
tion of the noisy spectrum. Figure 9 shows the 
corresponding cross-correlation functions and x 2 
functions for the four spectra. The PCAZ results 
were calculated using the first three Kennicutt 
eigenfunctions derived from the same 5 Kenni- 
cutt spectra used for the non-orthogonal cross- 
correlation method. For the spectra (a) to (c) 
the PCAZ method correctly locates the redshift 
of the noisy spectra. The corresponding cross- 
correlation functions clearly show a peak at the 
same redshift, but the noise peaks £11*6 clS large. 
PCAZ simultaneously uses many templates, ef- 
fectively averaging over the CCF noise. 

The fourth spectrum shows a case where both 
methods fail. The correct redshift is 0.115 but 
the presence of a sharp noise spikes, especially at 
3950 A has introduced spurious correlations. 

As surveys progress to thousands and tens of 
thousands of galaxies we expect this relative ad- 
vantage to increase: the derived eigenfunctions 
will include more subtle natural variations in the 
range of galaxy spectral features and will average 
over larger numbers of galaxies. 

4. CONCLUSIONS 

A new method of automatic redshift determi- 
nation has been developed and shown to be ca- 



pable of reproducing manually determined red- 
shifts with a minimal amount of manual inter- 
vention. The method is a superior generaliza- 
tion of cross-correlation and has the potential to 
provide a sounder mathematical basis for confi- 
dence in the final redshifts. The expansion co- 
efficients generated can be used to reconstruct 
noise filtered versions of the spectra and have 
the potential to be used for a basic classifica- 
tion of the spectra. The method proves more 
robust in the low signal/noise regime than inde- 
pendent cross-correlation and has greater poten- 
tial for very high success rates in upcoming very 
large redshifts surveys. 

This concludes the introduction and illustra- 
tion of the mathematical principles behind PCAZ. 
In Paper II in this series we will be looking in 
more detail at the reliability and the robustness 
of the method with much larger data sets and 
we will consider in detail the treatment of the 
data with realistic errors, the robustness with sig- 
nal/noise and compute typical probability distri- 
butions for redshift errors from PCAZ. We will 
also examine, via simulations, how this affects 
the measurement of derived bulk galaxy proper- 
ties from very large redshift surveys such as P{K) 
and the galaxy luminosity function. 
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Fig. 1. — Eigenfunctions obtained using the Ken- 
nicutt spectra listed in Table 1. The vertical 
scale is the flux per unit wavelength in normal- 
ized units. 

Fig. 2. — First five eigenfunctions obtained using 
a sample of 91 2dF spectra. 

Fig. 3. — The six 2dF spectra discussed in the 
text. The spectra have been corrected for the sky 
residuals and divided by the instrument response 
function. 

Fig. 4. — The x 2 functions (multiplied by a con- 
stant variance) corresponding to the spectra dis- 
cussed in the text. Note the different vertical 
scales on the 6 spectra. 

Fig. 5. — noise filtered reconstructions of the five 
spectra (a) to (e). The spectra are reconstructed 
from the first three Kennicutt eigenfunctions. 

Fig. 6. — Comparison with manual redshifts in 
the SGP463 2dF field for the three automated 
methods discussed in the text: (a) PCAZ red- 
shifts determined using the eigenfunctions de- 
rived from the NGP359 2dF field, (b) PCAZ 
redshifts determined using the eigenfunctions de- 
rived from the Kennicutt templates, (c) simple 
cross-correlation with the Kennicutt templates, 
picking the best peak. 

Fig. 7. — As Figure |6|, this time with the contin- 
uum signal/noise degraded to the range 3-10 for 
both the test data (NGP351) and in the 2dF field 
(SPG463) used to construct the eigenfunctions in 
(a). 

Fig. 8. — Noise degraded spectra and reconstruc- 
tions corresponding to points a-d in Figure 0. 
The top curve in each panel is the noise degraded 
input spectrum, the second curve is the PCA re- 
construction from the first three Kennicutt eigen- 
functions. The lowest curve is the original 2dF 

This 2-column preprint was prepared with the AAS IAT^X 
macros v4.0. 
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spectrum from which the input spectrum was de- 
rived. 

Fig. 9. — Cross-correlation functions from a 
simple non-orthogonalized cross-correlation ap- 
proach and x 2 from PCAZ for the spectra in 
Figure ||. The five individual cross-correlation 
functions for each spectrum are plotted on the 
same graph. 
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Table 1: Galactic spectra (Kennicutt (1992)) used to construct the first set of eigenspectra 



Galaxy 


Morphology 


NGC3379 


EO 


NGC4889 


E4 


NGC5248 


Sbc 


NGC2276 


Sc 


NGC4485 


Sm/Im 



Table 2: Redshifts and Expansion coefficients for spectra (a) to (e) 





PCAZ 


Visual Inspection 


bi(z) 


h(z) 


h(z) 




Redshift 


Redshift 








a 


0.0674 


0.0676 


1.193 


0.025 


0.029 


b 


0.1411 


0.1412 


0.075 


0.833 


0.001 


c 


0.2379 


0.2384 


0.069 


0.752 


0.003 


d 


0.1809 


0.1809 


1.104 


0.176 


-0.068 


c 


0.0600 




0.683 


0.139 


0.003 
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